39 research outputs found
Nonparametric Feature Extraction from Dendrograms
We propose feature extraction from dendrograms in a nonparametric way. The
Minimax distance measures correspond to building a dendrogram with single
linkage criterion, with defining specific forms of a level function and a
distance function over that. Therefore, we extend this method to arbitrary
dendrograms. We develop a generalized framework wherein different distance
measures can be inferred from different types of dendrograms, level functions
and distance functions. Via an appropriate embedding, we compute a vector-based
representation of the inferred distances, in order to enable many numerical
machine learning algorithms to employ such distances. Then, to address the
model selection problem, we study the aggregation of different dendrogram-based
distances respectively in solution space and in representation space in the
spirit of deep representations. In the first approach, for example for the
clustering problem, we build a graph with positive and negative edge weights
according to the consistency of the clustering labels of different objects
among different solutions, in the context of ensemble methods. Then, we use an
efficient variant of correlation clustering to produce the final clusters. In
the second approach, we investigate the sequential combination of different
distances and features sequentially in the spirit of multi-layered
architectures to obtain the final features. Finally, we demonstrate the
effectiveness of our approach via several numerical studies
Regression and Singular Value Decomposition in Dynamic Graphs
Most of real-world graphs are {\em dynamic}, i.e., they change over time.
However, while problems such as regression and Singular Value Decomposition
(SVD) have been studied for {\em static} graphs, they have not been
investigated for {\em dynamic} graphs, yet. In this paper, we introduce,
motivate and study regression and SVD over dynamic graphs. First, we present
the notion of {\em update-efficient matrix embedding} that defines the
conditions sufficient for a matrix embedding to be used for the dynamic graph
regression problem (under norm). We prove that given an
update-efficient matrix embedding (e.g., adjacency matrix), after an update
operation in the graph, the optimal solution of the graph regression problem
for the revised graph can be computed in time. We also study dynamic
graph regression under least absolute deviation. Then, we characterize a class
of matrix embeddings that can be used to efficiently update SVD of a dynamic
graph. For adjacency matrix and Laplacian matrix, we study those graph update
operations for which SVD (and low rank approximation) can be updated
efficiently
Sketch-based Randomized Algorithms for Dynamic Graph Regression
A well-known problem in data science and machine learning is {\em linear
regression}, which is recently extended to dynamic graphs. Existing exact
algorithms for updating the solution of dynamic graph regression problem
require at least a linear time (in terms of : the size of the graph).
However, this time complexity might be intractable in practice. In the current
paper, we utilize {\em subsampled randomized Hadamard transform} and
\textsf{CountSketch} to propose the first randomized algorithms. Suppose that
we are given an matrix embedding of the graph, where .
Let be the number of samples required for a guaranteed approximation error,
which is a sublinear function of . Our first algorithm reduces time
complexity of pre-processing to .
Then after an edge insertion or an edge deletion, it updates the approximate
solution in time. Our second algorithm reduces time complexity of
pre-processing to , where is the number of nonzero elements of . Then after
an edge insertion or an edge deletion or a node insertion or a node deletion,
it updates the approximate solution in time, with
. Finally, we show
that under some assumptions, if our first algorithm
outperforms our second algorithm and if our second
algorithm outperforms our first algorithm
Modeling Transitivity in Complex Networks
An important source of high clustering coefficient in real-world networks is
transitivity. However, existing approaches for modeling transitivity suffer
from at least one of the following problems: i) they produce graphs from a
specific class like bipartite graphs, ii) they do not give an analytical
argument for the high clustering coefficient of the model, and iii) their
clustering coefficient is still significantly lower than real-world networks.
In this paper, we propose a new model for complex networks which is based on
adding transitivity to scale-free models. We theoretically analyze the model
and provide analytical arguments for its different properties. In particular,
we calculate a lower bound on the clustering coefficient of the model which is
independent of the network size, as seen in real-world networks. More than
theoretical analysis, the main properties of the model are evaluated
empirically and it is shown that the model can precisely simulate real-world
networks from different domains with and different specifications.Comment: 16 pages, 4 figures, 3 table
Learning representations from dendrograms
We propose unsupervised representation learning and feature extraction from dendrograms. The commonly used Minimax distance measures correspond to building a dendrogram with single linkage criterion, with defining specific forms of a level function and a distance function over that. Therefore, we extend this method to arbitrary dendrograms. We develop a generalized framework wherein different distance measures and representations can be inferred from different types of dendrograms, level functions and distance functions. Via an appropriate embedding, we compute a vector-based representation of the inferred distances, in order to enable many numerical machine learning algorithms to employ such distances. Then, to address the model selection problem, we study the aggregation of different dendrogram-based distances respectively in solution space and in representation space in the spirit of deep representations. In the first approach, for example for the clustering problem, we build a graph with positive and negative edge weights according to the consistency of the clustering labels of different objects among different solutions, in the context of ensemble methods. Then, we use an efficient variant of correlation clustering to produce the final clusters. In the second approach, we investigate the combination of different distances and features sequentially in the spirit of multi-layered architectures to obtain the final features. Finally, we demonstrate the effectiveness of our approach via several numerical studies
Effectively Counting s-t Simple Paths in Directed Graphs
An important tool in analyzing complex social and information networks is s-t
simple path counting, which is known to be #P-complete. In this paper, we study
efficient s-t simple path counting in directed graphs. For a given pair of
vertices s and t in a directed graph, first we propose a pruning technique that
can efficiently and considerably reduce the search space. Then, we discuss how
this technique can be adjusted with exact and approximate algorithms, to
improve their efficiency. In the end, by performing extensive experiments over
several networks from different domains, we show high empirical efficiency of
our proposed technique. Our algorithm is not a competitor of existing methods,
rather, it is a friend that can be used as a fast pre-processing step, before
applying any existing algorithm
Discriminative Distance-Based Network Indices with Application to Link Prediction
In large networks, using the length of shortest paths as the distance measure
has shortcomings. A well-studied shortcoming is that extending it to
disconnected graphs and directed graphs is controversial. The second
shortcoming is that a huge number of vertices may have exactly the same score.
The third shortcoming is that in many applications, the distance between two
vertices not only depends on the length of shortest paths, but also on the
number of shortest paths. In this paper, first we develop a new distance
measure between vertices of a graph that yields discriminative distance-based
centrality indices. This measure is proportional to the length of shortest
paths and inversely proportional to the number of shortest paths. We present
algorithms for exact computation of the proposed discriminative indices.
Second, we develop randomized algorithms that precisely estimate average
discriminative path length and average discriminative eccentricity and show
that they give -approximations of these indices. Third, we
perform extensive experiments over several real-world networks from different
domains. In our experiments, we first show that compared to the traditional
indices, discriminative indices have usually much more discriminability. Then,
we show that our randomized algorithms can very precisely estimate average
discriminative path length and average discriminative eccentricity, using only
few samples. Then, we show that real-world networks have usually a tiny average
discriminative path length, bounded by a constant (e.g., 2). Fourth, in order
to better motivate the usefulness of our proposed distance measure, we present
a novel link prediction method, that uses discriminative distance to decide
which vertices are more likely to form a link in future, and show its superior
performance compared to the well-known existing measures
Mining Rooted Ordered Trees under Subtree Homeomorphism
Mining frequent tree patterns has many applications in different areas such
as XML data, bioinformatics and World Wide Web. The crucial step in frequent
pattern mining is frequency counting, which involves a matching operator to
find occurrences (instances) of a tree pattern in a given collection of trees.
A widely used matching operator for tree-structured data is subtree
homeomorphism, where an edge in the tree pattern is mapped onto an
ancestor-descendant relationship in the given tree. Tree patterns that are
frequent under subtree homeomorphism are usually called embedded patterns. In
this paper, we present an efficient algorithm for subtree homeomorphism with
application to frequent pattern mining. We propose a compact data-structure,
called occ, which stores only information about the rightmost paths of
occurrences and hence can encode and represent several occurrences of a tree
pattern. We then define efficient join operations on the occ data-structure,
which help us count occurrences of tree patterns according to occurrences of
their proper subtrees. Based on the proposed subtree homeomorphism method, we
develop an effective pattern mining algorithm, called TPMiner. We evaluate the
efficiency of TPMiner on several real-world and synthetic datasets. Our
extensive experiments confirm that TPMiner always outperforms well-known
existing algorithms, and in several cases the improvement with respect to
existing algorithms is significant.Comment: This paper is accepted in the Data Mining and Knowledge Discovery
journal
(http://www.springer.com/computer/database+management+%26+information+retrieval/journal/10618